Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers

نویسندگان

  • Jaeyoung Choi
  • David W. Walker
  • Jack J. Dongarra
چکیده

This paper describes the Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PIJhlMA package includes not only the non-transposed matrix multiplication routine C = A . B. but also transposed multiplication routines C = AT . B, C = A . BT, and C = AT . BT, for a block scattered data distribution. The routines perform efficiently for a wide rauge of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta computer.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers

This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P Q processor template with a block scattered data distribution. P , Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD...

متن کامل

A New Direction to Parallelize Winograd's Algorithm on Distributed Memory Computers

Winograd’s algorithm to multiply two n × n matrices reduces the asymptotic operation count from O(n3) of the traditional algorithm to O(n2.81), thus on distributed memory computers, the association of Winograd’s algorithm and the parallel matrix multiplication algorithms always gives remarkable results. Within this association, the application of Winograd’s algorithm at the inter-processor leve...

متن کامل

The Spectral Decomposition of Nonsymmetric Matrices on Distributed Memory Parallel Computers

The implementation and performance of a class of divide-and-conquer algorithms for computing the spectral decomposition of nonsymmetric matrices on distributed memory parallel computers are studied in this paper. After presenting a general framework, we focus on a spectral divide-and-conquer (SDC) algorithm with Newton iteration. Although the algorithm requires several times as many oating poin...

متن کامل

A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers

We present a fast and scalable matrix multiplication algorithm on distributed memory concurrent computers, whose performance is independent of data distribution on processors, and call it DIMMA1 (Distribution-Independent Matrix Multiplication Algorithm). The algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap computation and communication effectivel...

متن کامل

Comparison of Scalable Parallel Matrix Multiplication Libraries

This paper compares two general library routines for performing parallel distributed matrix multiplication. The PUMMA algorithm utilizes block scattered data layout, whereas BiMMeR utilizes virtual 2-D torus wrap. The algorithmic diierences resulting from these diierent layouts are discussed as well as the general issues associated with diierent data layouts for library routines. Results on the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Concurrency - Practice and Experience

دوره 6  شماره 

صفحات  -

تاریخ انتشار 1994